Message classification using legitimate contact points

ABSTRACT

A system and method are disclosed for classifying a message. The method includes receiving the message, identifying all items of a certain type in the message, determining whether each of the items meets a criterion, and in the event that all the items are determined to meet the criterion, determining a classification of the message. The system includes an interface configured to receive the message, a processor coupled to the interface, configured to identify all items of a certain type in the message; determine whether each of the items meets a criterion; and in the event that all the items are determined to meet the criterion, determine a classification of the message.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation and claims the priority benefit ofU.S. patent application Ser. No. 12/502,189 filed Jul. 13, 2009, whichwill issue as U.S. Pat. No. 8,108,477 and entitled “MessageClassification Using Legitimate Contact Points,” which is a continuationand claims the priority benefit of U.S. patent application Ser. No.11/927,497 filed Oct. 29, 2007, now U.S. Pat. No. 7,562,122 and entitled“Message Classification Using Allowed Items,” which is a continuationand claims the priority benefit of U.S. patent application Ser. No.10/616,703 filed Jul. 9, 2003, now U.S. Pat. No. 7,406,502 and entitled“Message Classification Using Allowed Items,” which claims the prioritybenefit of U.S. Provisional Patent Application No. 60/476,419 filed Jun.6, 2003 and entitled “A Method for Classifying Email Using White ContentThumbprints,” and which is a continuation in part of co-pending U.S.patent application Ser. No. 10/371,987 filed Feb. 20, 2003 and entitled“Using Distinguishing Properties to Classify Messages.”

FIELD OF THE INVENTION

The present invention relates generally to message classification. Morespecifically, a technique for avoiding junk messages (spam) isdisclosed.

BACKGROUND OF THE INVENTION

Electronic messages have become an indispensable part of moderncommunication. Electronic messages such as email or instant messages arepopular because they are fast, easy, and have essentially no incrementalcost. Unfortunately, these advantages of electronic messages are alsoexploited by marketers who regularly send out unsolicited junk messages.The junk messages are referred to as “spam”, and spam senders arereferred to as “spammers”. Spam messages are a nuisance for users. Theyclog people's inbox, waste system resources, often promote distastefulsubjects, and sometimes sponsor outright scams.

There are a number of commonly used techniques for classifying messagesand identifying spam. For example, blacklists are sometimes used fortracking known spammers. The sender address of an incoming message iscompared to the addresses in the blacklist. A match indicates that themessage is spam and prevents the message from being delivered. Othertechniques such as rule matching and content filtering analyze themessage and determine the classification of the message according to theanalysis. Some systems have multiple categories for messageclassification. For example, a system may classify a message as one ofthe following categories: spam, likely to be spam, likely to be goodemail, and good email, where only good email messages are allowedthrough and the rest are either further processed or discarded.

Spam-blocking systems sometimes misidentify non-spam messages. Forexample, a system that performs content filtering may be configured toidentify any messages that include certain word patterns, such as“savings on airline tickets” as spam. However, an electronic ticketconfirmation message that happens to include such word patterns may bemisidentified as spam or possibly spam. Misidentification of goodmessages is undesirable, since it wastes system resources, and in theworst case scenario, causes good messages to be classified as spam andlost.

It would be useful to have a technique that would more accuratelyidentify non-spam messages. Such a technique would not be effective ifspammers could easily alter parts of the spam messages they sent so thatthe messages would be identified as non-spam. Thus, it would also bedesirable if non-spam messages identified by such a technique is noteasily spoofed.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be readily understood by the followingdetailed description in conjunction with the accompanying drawings,wherein like reference numerals designate like structural elements, andin which:

FIG. 1 is a flowchart illustrating the message classification processaccording to one embodiment.

FIG. 2 is a flowchart illustrating the details of the signaturegeneration process according to one embodiment.

FIG. 3 is a flow chart illustrating the classification of a messageaccording to another embodiment.

FIG. 4 is a flow chart illustrating a registration process for updatingthe database, according to one embodiment.

FIG. 5 is a table used for aggregating user inputs, according to onesystem embodiment.

DETAILED DESCRIPTION

It should be appreciated that the present invention can be implementedin numerous ways, including as a process, an apparatus, a system, or acomputer readable medium such as a computer readable storage medium or acomputer network wherein program instructions are sent over optical orelectronic communication links. It should be noted that the order of thesteps of disclosed processes may be altered within the scope of theinvention.

A detailed description of one or more preferred embodiments of theinvention is provided below along with accompanying figures thatillustrate by way of example the principles of the invention. While theinvention is described in connection with such embodiments, it should beunderstood that the invention is not limited to any embodiment. On thecontrary, the scope of the invention is limited only by the appendedclaims and the invention encompasses numerous alternatives,modifications and equivalents. For the purpose of example, numerousspecific details are set forth in the following description in order toprovide a thorough understanding of the present invention. The presentinvention may be practiced according to the claims without some or allof these specific details. For the purpose of clarity, technicalmaterial that is known in the technical fields related to the inventionhas not been described in detail so that the present invention is notunnecessarily obscured.

In U.S. patent application Ser. No. 10/371,987 by Wilson, et al filedFeb. 20, 2003 entitled: “USING DISTINGUISHING PROPERTIES TO CLASSIFYMESSAGES” which is herein incorporated by reference for all purposes, atechnique using distinguishing properties to identify electronicmessages is described. The technique uses distinguishing propertieswithin messages, such as contact information, to identify messages thathave previously been classified. In some embodiments, the technique isapplied to identify spam messages. However, spammers aware of such adetection scheme may change their contact information frequently toprevent their messages from being identified.

An improved technique is disclosed. The technique prevents spammers fromcircumventing detection by using items in the message to identifynon-spam messages. All items of a certain type in the message areidentified, and checked to determine whether they meet a certaincriterion. In some embodiments, the items are distinguishing propertiesor signatures of distinguishing properties. They are identified andlooked up in a database. In various embodiments, the database may beupdated by a registration process, based on user input, and/orpost-processing stored messages. In some embodiments, the items arelooked up in a database of acceptable items. A message is classified asnon-spam if all the items are found in the database. If not all theitems are found in the database, the message is further processed todetermine its classification.

Spammers generally have some motives for sending spam messages. Althoughspam messages come in all kinds of forms and contain different types ofinformation, nearly all of them contain some distinguishing propertiesfor helping the senders fulfill their goals. For example, in order forthe spammer to ever make money from a recipient, there must be some wayfor the recipient to contact the spammer. Thus, most spam messagesinclude at least one contact point, whether in the form of a phonenumber, an address, a universal resource locator (URL), or any otherappropriate information for establishing contact with some entity. Thesedistinguishing properties, such as contact points, instructions forperforming certain tasks, distinctive terms such as stock tickersymbols, names of products or company, or any other informationessential for the message, are extracted and used to identify messages.

Similarly, non-spam messages may also have distinguishing properties.For example, electronic ticket confirmations and online purchase orderscommonly include contact points such as URL's, email addresses, andtelephone numbers to the sender's organization. It is advantageous thatspam messages always include some distinguishing properties that aredifferent from the distinguishing properties in non-spam messages. Forexample, the URL to the spammer's website is unlikely to appear in anynon-spam message. To identify non-spam messages, a database is used forstoring acceptable distinguishing properties. The database may be atable, a list, or any other appropriate combination of storage softwareand hardware. A message that only has acceptable distinguishingproperties is unlikely to be spam. Since information that is notdistinguishing is discarded during the classification process, it ismore difficult for the spammers to alter their message generation schemeto evade detection.

For the purpose of example, details of email message processing usingcontact points and contact point signatures to determine whether themessage is acceptable are discussed, although it should be noted thatthe technique are also applicable to the classification of other formsof electronic messages using other types of items. It should also benoted that different types of criterion and classification may be usedin various embodiments.

FIG. 1 is a flowchart illustrating the message classification processaccording to one embodiment. A message is received (100), and all thecontact points are selected (102). It is then determined whether all thecontact points can be found in a database of previously storedacceptable contact points (104). If all the contact points are found inthe database, the message is classified as non-spam and delivered to theuser (106). The contact points that are not found in the database may becontact points for a spammer or contact points for a legitimate senderthat have not yet been stored in the database. Thus, if not all contactpoints are found in the database, the message cannot be classified asnon-spam and further processing is needed to accurately classify themessage (108). The processing may include any appropriate messageclassification techniques, such as performing a whitelist test on thesender's address, using summary information or rules to determinewhether the content of the message is acceptable, etc.

In some embodiments, the system optionally generates signatures based onthe selected contact points. The signatures can be generated using avariety of methods, including compression, expansion, checksum, hashfunctions, etc. The signatures are looked up in a database of acceptablesignatures. If all the signatures are found in the database, the messageis classified as non-spam; otherwise, the message is further processedto determine its classification. Since signatures obfuscate the actualcontact point information, using signatures provides better privacyprotection for the intended recipient of the message, especially whenthe classification component resides on a different device than therecipient's.

FIG. 2 is a flowchart illustrating the details of the signaturegeneration process according to one embodiment. Various contact pointsare extracted from the message and used to generate the signatures. Thisprocess is used both in classifying incoming messages and in updatingthe database with signatures that are known to be from non-spam. Thesender address, email addresses, links to URLs such as web pages,images, etc. and the phone numbers in the message are extracted (200,202, 204, 206). There are many ways to extract the contact information.For example, telephone numbers usually include 7-10 digits, sometimesseparated by dashes and parenthesis. To extract telephone numbers, thetext of the message is scanned, and patterns that match varioustelephone number formats are extracted. Any other appropriate contactinformation is also extracted (208).

The extracted contact points are then reduced to their canonicalequivalents (210). The canonical equivalent of a piece of information isan identifier used to represent the same information, regardless of itsformat. For example, a telephone number may be represented as1-800-555-5555 or 1(800)555-5555, but both are reduced to the samecanonical equivalent of 18005555555. In some embodiments, the canonicalequivalent of an URL and an email address is the domain name. Forexample, http://www.mailfrontier.com/contact,www.mailfrontier.com/support and jon@mailfrontier.com are all reduced tothe same canonical equivalent of mailfrontier.com. It should be notedthat there are numerous techniques for arriving at the canonicalequivalent of any distinguishing property, and different implementationmay employ different techniques.

After the contact points are reduced to their canonical equivalents,signatures corresponding to the canonical equivalents are generated andadded to the database (212). There are various techniques for generatingthe signature, such as performing a hash function or a checksum functionon the characters in the canonical equivalent.

The database shown in this embodiment stores signatures that correspondto various acceptable contact points. Such a database is also used inthe subsequent embodiments for the purposes of illustration. It shouldbe noted that the acceptable contact points, other distinguishingproperty and/or their signatures may be stored in the database in someembodiments.

FIG. 3 is a flow chart illustrating the classification of a messageaccording to another embodiment. In this embodiment, each contact pointof the message is tested and used to classify the message. Once themessage is received (300), it is optionally determined whether themessage includes any contact points (301). If the message does notinclude any contact points, the message may or may not be spam.Therefore, control is transferred to 312 to further process the messageto classify it. If the message includes at least one contact point, themessage is parsed and an attempt is made to extract the next contactpoint in the message (302). There may not be another contact point to beextracted from the message if all the distinguishing properties in themessage have been processed already. Hence, in the next step, it isdetermined whether the next contact point is available (304). If thereare no more distinguishing properties available, the test has concludedwithout finding any contact point in the message that does not alreadyexist in the database. Therefore, the message is classified asacceptable (306).

If the next contact point is available, it is reduced to its canonicalequivalent (307) and a signature is generated based on the canonicalequivalent (308). It is then determined whether the signature exists inthe database (310). If the signature does not exist in the database,there is a possibility that the message is spam and further processingis needed to classify the message (312). If, however, a signature existsin the database, it indicates that the contact point is acceptable andcontrol is transferred to step 302 where the next contact point in themessage is extracted and the process of generating and comparing thesignature is repeated.

For the message classification technique to be effective, the databaseshould include as many signatures of acceptable contact points aspossible, and exclude any signatures of contact points that may bedistinguishing for spam messages. In some embodiments, the database isupdated using a registration process. The registration process allowslegitimate businesses or organizations to store contact points used inthe messages they send to their customers or target audience at acentral spam filtering location. The legitimacy of the organization isestablished using certificates such as the certificate issued by acertificate authority such as Verisign, an identifier or code issued bya central spam filtering authority, or any other appropriatecertification mechanism that identifies the validity of an organization.

FIG. 4 is a flow chart illustrating a registration process for updatingthe database, according to one embodiment. Once a registration messageis received (400), it is determined whether the certificate is valid(402). If the certificate is not valid, the message is ignored (404). Inthis embodiment, if the message certificate is valid, optional steps405, 406 and 407 are performed. The classification of the message senderis obtained from the certificate (405). It is then further tested usingother spam determination techniques to determine whether the message isspam (406). This optional step is used to prevent spammers fromobtaining a valid certificate and add their spam messages to thedatabase. If the message is determined to be spam by these additionaltests, control is transferred to step 404 and the message is ignored.If, however, the message is determined to be non-spam, one or moresignatures are generated based on the contact points in the message(408). The signatures, sender classification, and other associatedinformation for the message are then saved in the database (410).

Different organizations or individuals may have different criteria forwhich messages are acceptable, and may only allow a subset of theregistered signature database. In some embodiments, the signaturedatabase from the registration site is duplicated by individualorganizations that wish to use the signature database for spam blockingpurposes. The system administrators or end users are then able tocustomize their message filtering policies using the database entries.Using a policy allows some database entries to be selected for filteringpurposes.

In some embodiments, the database is updated dynamically as messages arereceived, based on classifications made by the recipients. Preferably,the system allows for collaborative spam filtering where the responsefrom other recipients in the system is incorporated into the messageclassification process. Different recipients of the same message mayclassify the message, therefore the contact points in the message,differently. The same contact point may appear in a message that isclassified as non-spam as well as a message that is classified as spam.The system aggregates the classification information of a contact point,and determines whether it should be added to the database of acceptablecontact points.

FIG. 5 is a table used for aggregating user inputs, according to onesystem embodiment. The system extracts the contact points in themessages and generates their signature. The state of each signature istracked by three counters: acceptable, unacceptable, and unclassified,which are incremented whenever a message that includes the contact pointis classified as non-spam, spam or unknown, respectively. A probabilityof being acceptable is computed by the system based on the countervalues and updated periodically. A signature is added to the databaseonce its probability of being acceptable exceeds a certain threshold. Insome embodiments, the signature is removed from the database if itsprobability of being acceptable falls below the threshold.

In some embodiments, the database is updated by post-processingpreviously stored messages. The messages are classified as spam ornon-spam using spam classification techniques and/or previous userinputs. The contact points are extracted, their signatures generated,and their probabilities of being acceptable are computed. The signaturesof the contact points that are likely to be acceptable are stored in thedatabase.

An improved technique for classifying electronic messages has beendisclosed. The technique uses distinguishing properties in a message andits corresponding signature to classify the message and determinewhether it is acceptable. In some embodiments, the distinguishingproperties are contact points. A database of registered signatures isused in the classification process, and can be customized to suit theneeds of individuals. The technique has low overhead, and is able toquickly and accurately determine non-spam messages.

Although the foregoing invention has been described in some detail forpurposes of clarity of understanding, it will be apparent that certainchanges and modifications may be practiced within the scope of theappended claims. It should be noted that there are many alternative waysof implementing both the process and apparatus of the present invention.Accordingly, the present embodiments are to be considered asillustrative and not restrictive, and the invention is not to be limitedto the details given herein, but may be modified within the scope andequivalents of the appended claims.

What is claimed:
 1. A method for classifying a message, the methodcomprising: storing information in a database in memory, the storedinformation regarding one or more previously generated signatures ofpreviously received messages; receiving a message, the message includinga distinguishing property; and executing instructions stored in amemory, wherein execution of the instructions by a processor: parses themessage body to determine that the distinguishing property includes acontact point that is used to classify the message, reduces the contactpoint to an identifier that represents the contact point regardless ofindividual format of the contact point; generates a signature based onthe identifier; compares the generated signature to the one or morepreviously generated signatures stored in the database; and updates aprobability associated with the generated signature based on thecomparison, wherein the generated signature is classified orreclassified based on the updated probability meeting a threshold. 2.The method of claim 1, wherein the execution of instructions by theprocessor further classifies the message as acceptable based on thegenerated signature matching one of the previously generated signaturesstored in the database, the matching previously generated signaturehaving been classified as acceptable.
 3. The method of claim 2, whereinthe execution of instructions by the processor further updates thedatabase through a registration process.
 4. The method of claim 3,wherein the registration process includes: receiving a registrationmessage; checking a certificate associated with the message, thecertificate confirming that the registration message is from anacceptable source; extracting an item from the message; and adding anentry derived from the item to the database of acceptable items.
 5. Themethod of claim 1, wherein the execution of instructions by theprocessor further: parses the message to identify an additional contactpoint when the generated signature has not been previously stored in thedatabase, reduces the contact point to an identifier that represents theadditional contact point regardless of individual format of each of theadditional contact point, generates a signature based on the identifier,and compares the generated signature of the identifier to one or morepreviously generated signatures stored in a database of acceptablesignatures.
 6. The method of claim 1, wherein the contact point includesa universal resource locator (URL).
 7. The method of claim 1, whereinthe contact point includes a phone number.
 8. The method of claim 1,wherein the contact point includes an address.
 9. The method of claim 1,wherein generating the signature includes performing a function oncharacters associated with the contact point, the function selected froma group consisting of a hash function, checksum, compression, andexpansion.
 10. A system for classifying a message comprising: a memoryfor storing information in a database, the stored information regardingone or more previously generated signatures of previously receivedmessages; an interface coupled to a mail server for receiving a message,the message including a distinguishing property; and a processor at themail server for executing instructions stored in memory, whereinexecution of the instructions by the processor: parses the message todetermine that the distinguishing property includes a contact point thatis used to classify the message; reduces the contact point to anidentifier that represents the contact point regardless of individualformat of the contact point; generates a signature based on theidentifier; compares the generated signature to the one or morepreviously generated signatures stored in the database; and updates aprobability associated with the generated signature based on thecomparison, wherein the generated signature is classified orreclassified based on the updated probability meeting a threshold. 11.The system of claim 10, wherein the processor further classifies themessage as acceptable based on the generated signature matching one ofthe previously generated signatures stored in the database, the matchingpreviously generated signature having been classified as acceptable. 12.The system of claim 10, wherein the processor further updates thedatabase through a registration process.
 13. The system of claim 12,wherein the processor performs the registration process, wherein theregistration process comprises: receiving a registration message;checking a certificate associated with the message, the certificateconfirming that the registration message is from an acceptable source;extracting an item from the message; and adding an entry derived fromthe item to the database of acceptable items.
 14. The system of claim10, wherein the processor further executes instructions to: parse themessage to identify an additional contact point when the generatedsignature has not been previously stored in the database, reduce thecontact point to an identifier that represents the additional contactpoint regardless of individual format of each of the additional contactpoint, generate a signature based on the identifier, and compare thegenerated signature of the identifier to one or more previouslygenerated signatures stored in a database of acceptable signatures. 15.The system of claim 10, wherein the contact point includes a universalresource locator (URL).
 16. The system of claim 10, wherein the contactpoint includes a phone number.
 17. The system of claim 10, wherein thecontact point includes an address.
 18. The system of claim 10, whereinthe processor generates the signature by performing a function oncharacters associated with the contact point, the function selected froma group consisting of a hash function, checksum, compression, andexpansion.
 19. A non-transitory computer readable storage medium havingembodied thereon a program, the program being executable by a processorto perform a method for classifying a message, the method comprising:storing information regarding one or more previously generatedsignatures of previously received messages; receiving a message, themessage including a distinguishing property; parsing the message todetermine that the distinguishing property includes a contact point thatis used to classify the message; reducing the contact point to anidentifier that represents the contact point regardless of individualformat of the contact point; generating a signature based on theidentifier; comparing the generated signature to the one or morepreviously generated signatures; and updating a probability associatedwith the generated signature based on the comparison, wherein thegenerated signature is classified or reclassified based on the updatedprobability meeting a threshold.